Introduction to Open Data Science - Course Project

About the project

The following documents my progress in the course Introduction to Open Data Science. The course includes basic R, RStudio, RMarkdown and GitHub usage, as well as several more advanced topics. I am saving all the weekly assignments and the course diary in a GitHub repository available here.

Weekly Assignments

Week 1

## [1] "Mon Nov 23 16:27:06 2020"

I have some skills in statistical methods and the use of R but I am unfamiliar with GitHub (except as a place for plundering code). I have done most of work with longitudinal register data and with parametric methods, and, based on the course description, I hope that the course would provide me new knowledge on:

  • Open Data repositories
  • Survey methods
  • New methods (e.g. Cluster analysis).

But that’s probably enough for now, more to follow during next weeks.


Week 2

Introduction

In this weeks’ assignment I am looking into the association between students’ learning approaches and their learning outcomes. The assignment is based on data collected during a Statistics course. The data consist of a question pattern that includes 32 items measuring three different dimensions of learning. In addition, there are information available on participants’ global attitude towards statistics, gender, age and course exam points. More information on the data can be found here.The data I am using in this assignment is a modified version of the original data (see below).

#Load packages
library(tidyverse)
library(GGally)


#Read in the data
setwd("~/IODS-project/data")
learning_2014 <- read.table("learning_2014.txt")

Data

str(learning_2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
head(learning_2014)
##   age gender attitude points     deep  stra     surf
## 1  53      F      3.7     25 3.583333 3.375 2.583333
## 2  55      M      3.1     12 2.916667 2.750 3.166667
## 3  49      F      2.5     24 3.500000 3.625 2.250000
## 4  53      M      3.5     10 3.500000 3.125 2.250000
## 5  49      M      3.7     22 3.666667 3.625 2.833333
## 6  38      F      3.8     21 4.750000 3.625 2.416667
summary(learning_2014)
##       age           gender             attitude         points     
##  Min.   :17.00   Length:166         Min.   :1.400   Min.   : 7.00  
##  1st Qu.:21.00   Class :character   1st Qu.:2.600   1st Qu.:19.00  
##  Median :22.00   Mode  :character   Median :3.200   Median :23.00  
##  Mean   :25.51                      Mean   :3.143   Mean   :22.72  
##  3rd Qu.:27.00                      3rd Qu.:3.700   3rd Qu.:27.75  
##  Max.   :55.00                      Max.   :5.000   Max.   :33.00  
##       deep            stra            surf      
##  Min.   :1.583   Min.   :1.250   Min.   :1.583  
##  1st Qu.:3.333   1st Qu.:2.625   1st Qu.:2.417  
##  Median :3.667   Median :3.188   Median :2.833  
##  Mean   :3.680   Mean   :3.121   Mean   :2.787  
##  3rd Qu.:4.083   3rd Qu.:3.625   3rd Qu.:3.167  
##  Max.   :4.917   Max.   :5.000   Max.   :4.333

The modified data used in this assignment contains 7 variables and 166 students. Participants’ age (17-55 years, mean 25.5 years), gender (male/female) and study exam score (7-33 points, mean score 22.7) are recorded in the corresponding variables. Variables deep, stra and surf measure the dimensions of learning approches and study skills: deep, strategic and surface learning. I have transformed these variables from the original question pattern by taking the mean of all the questions measuring a specific dimension. The possible values of the variables range between 1 and 5. The variable attitude measures student’s global attitude towards statistics. This is a sum variable consisting of 10 items, which I have scaled back to the original scale of the questions by taking the mean of these items (scale 1-5).

Exploratory analysis with data visualization

Let’s have a look at the data with the function ggpairs:

#Lower adds the regression lines
#Upper scales the text in correlation
#boxes
ggpairs(learning_2014,
        aes(
          col=gender,alpha=0.5),
        lower = list(continuous =
                       wrap("smooth")),
        upper = list(continuous =
                       wrap("cor", size =
                              2)))
Figure 1: Graphical overview of the Learning 2014 data

Figure 1: Graphical overview of the Learning 2014 data

Figure 1 summarizes the data. On the diagonal of the matrix of plots, there are density plots for continous variables and bar plots for discrete gender. The diagonal thus shows the distribution of each variable. In the upper triangle, there is a correlation matrix for the continuous variables and a boxplot for each variable by gender. The lower triangle contains scatter plots for continuous variables and histograms of these variables by gender.

There are more females in the data, and they are slightly younger than males.There are indications of slight gender differences, especially in the attitude and surface learning variables (see the density plots & boxplots). However, since I am mostly interested in the association between learning approaches and outcomes (exam score), I will discard gender and differences to make the plot more readable (Figure 2).

ggpairs(learning_2014 %>%
          select(c("attitude",
                   "surf",
                   "deep",
                   "stra",
                   "points")),
        lower = list(continuous =
                       wrap("smooth")),
        upper = list(continuous =
                       wrap("cor", size =
                              2)))
Figure 2: Graphical overview of the Learning 2014 data

Figure 2: Graphical overview of the Learning 2014 data

Figure 2 indicates that the strongest predictor of student’s exam score is student’s attitude towards statistics. The association is linear and positive: the higher the attitude, the higher the exam score. Of the learning approaches, strategic dimension seems to be positively and surface learning negatively associated with exam score. Based on correlation coefficients, both of these associations are relatively weak. Deep learning seems not to be correlated with exam score.Deep learning seems also to be distributed less evenly as the other dimensions: less students have low scores on this dimension.

Regression analysis

Based on the investigation above, I assume that the exam score is associated with attitude, strategic and surface learning. I will examine this association by fitting a linear regression model:

model1 <-
  lm(points ~ 
       attitude + stra + surf,
     data=learning_2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning_2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The model output confirms the reasoning based on the plots above. Attitude has a positive association with exam score, as does higher score in strategic learning. High score on surface learning decreases exam score. The multiple R-squared suggests that 21% of the variation in exam score is explained by these three variables. However, neither strategic nor surface learning are statistically significantly different from 0 on the conventional confidence level 95%, i.e. the p-value of the test that the coefficient equals 0 is larger than 0.05.Thus, I exclude these variables and my final model includes only attitude as an explanatory variable.

model2 <-
  lm(points ~ 
       attitude,
     data=learning_2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning_2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Based on the summary of the model, attitude is statistically significantly associated with exam score. On average, a 1 point increase in attitude increases the exam score by around 3 points.The intercept estimates the points an individual with 0 score on attitude would have (around 12 points). Multiple R-squared shows that 19% of variation in exam score is explained by differences inthe global attitude towards statistics.

Figure 3 shows the basic scatter plot of exam scores and attitude, and the fitted regression lines in upper left corner. In the other panels, I have plotted the basic diagnostic plots Residuals vs. Fitted values (upper right), Normal QQ-plot (lower left) and Residuals vs. Leverage (lower right).

par(mfrow=c(2,2))
plot(y=learning_2014$points,
     x=learning_2014$attitude)
abline(lm(points~attitude,data=learning_2014))
plot(model1,which=c(1,2,5))
Figure 3: Diagnostic plots

Figure 3: Diagnostic plots

The basic assumption of a linear regression model is that the relationship between the variables is linear. From the scatter plot, this assumption seems to hold, although there are some somewhat outlying observations with low test scores and high attitude score. The linarity assumpption is also (mostly) confirmed by the residuals vs. fitted values plot. If the linearity assumption would not hold, we would expect to see very large residuals. There seems to be several outliers in this case with high residuals.

The second assumption of the model is homoscedasticity of residuals. This means that the variance of the residuals is equal regardless of the values of the explanatory variables. From the residuals vs. fitted values, this seems to be the case. There are no distinct patterns between the residuals and fitted values. Third assumption of the model is that the residuals have a normal distribution. According to the QQ-plot, this assumption seems to hold as the residuals seem to fit the theoretical (normal) distribution relatively well.

The fourth assumption of the model is independence between observations. This can not be examined through diagnostic plots. We can assume that the attitudes and exam scores should be independent from each other for the most parts. There seems to be a small cluster of individuals with high attitude score and low exam score, which could be speculated to be a group of friends that like statistics but whose previous night was a bit too long. This would be a violation of the independence observation. The residuals vs. leverage plot suggest that there are no observations that would have a major impact to the regression results. This is reassuring in terms of the model’s sensitivity to these outliers. All in all, the model seems to be a relatively good one. Whether it is really useful is another thing.


Week 3

Introduction

This weeks’ assignment focuses on the risk factors of high alcohol use among secondary school students in Portugal. I am using data collected from two schools in Portugal (Cortez and Silva 2008). The data includes information on weekly alcohol use, grades in Mathematics and Portuguese, as well as data on several background variables, such as parental education. I will start the analysis by graphically exploring the dataset, then move on to using logistic regression to describe relevant associations, and finally assess the model validity by cross-validation exercises.

#Load packages
library(tidyverse)
library(GGally)
library(patchwork)
library(openxlsx)
#Read in the data
setwd("~/IODS-project/data")
alc_use <- read.xlsx("alc_use.xlsx")

Exploratory analysis

I have modified the data by combining weekday and weekend alcohol use and constructed a binary variable indicating high alcohol use (average daily alcohol consumption > 2). Based on previous knowledge on the topic, I assume that males, students with less educated parents, students with problematic family relations and students going out a lot are likely to be at increased risk of high alcohol use.

First, let’s look at parental education. Education is available for both parents, but I will use information on either parent with the highest education.

#Generate variable paredu
#=highest education of either parent

alc_use$paredu <-
  ifelse(alc_use$Fedu>alc_use$Medu,
         alc_use$Fedu,alc_use$Medu)

#Modify into factor
alc_use$paredu <-
  as.factor(alc_use$paredu)

#Plot means of high use 
#by parental education

alc_use %>% 
  group_by(paredu) %>%
  summarise(mean_high_use=mean(high_use),
            n=n())%>%
  ggplot(aes(x=paredu,
             y=mean_high_use,
             size=n,
             col=paredu)) +
  geom_point() +
  theme_minimal()
Means of high alcohol use by parental education. Size represents the number of observations

Means of high alcohol use by parental education. Size represents the number of observations

prop.table(table(alc_use$paredu))
## 
##          1          2          3          4 
## 0.09189189 0.23783784 0.25675676 0.41351351
prop.table(table(alc_use$paredu,alc_use$high_use),1)
##    
##         FALSE      TRUE
##   1 0.6764706 0.3235294
##   2 0.7500000 0.2500000
##   3 0.6526316 0.3473684
##   4 0.7058824 0.2941176
chisq.test(alc_use$paredu,alc_use$high_use)
## 
##  Pearson's Chi-squared test
## 
## data:  alc_use$paredu and alc_use$high_use
## X-squared = 2.1775, df = 3, p-value = 0.5364

From the figure we see that there are no clear pattern between parental education and high alcohol use. The highest proportion of high alcohol use is at level 3, and lowest at level 2. The Chi-squared test also indicates that there is no statistically significant relationship between these two variables. Moreover, low parental education is relatively uncommon among the students (10%).

Second, I’ll look at the relationship between the quality of family relations and high alcohol use.

alc_use %>%
  ggplot(aes(x=high_use,
             y=famrel)) +
  geom_violin() +
  geom_jitter() +
  theme_minimal()
Quality of family relations and high alcohol consumption.Violin plot

Quality of family relations and high alcohol consumption.Violin plot

prop.table(table(alc_use$famrel))
## 
##          1          2          3          4          5 
## 0.02162162 0.04864865 0.17297297 0.48648649 0.27027027

The violin plot indicates that the distribution of the quality of family relations is rather skewed towards the high values (good relations). In addition, not large differences seem to present between the high use and low use of alcohol.

However, if we look at the distributions closer, it can be seen that around 80% of those without high alcohol consumption have rresponded that their family relations are very good or excellent (4 or 5). Among those with high consumption, there are slightly less these high values. Hence, I transform the variable into a binary one where 1=very good relations, 0=lower.

prop.table(table(alc_use$famrel,alc_use$high_use),2)
##    
##          FALSE       TRUE
##   1 0.02316602 0.01801802
##   2 0.03474903 0.08108108
##   3 0.15057915 0.22522523
##   4 0.49420849 0.46846847
##   5 0.29729730 0.20720721
alc_use$famrel_b <-
  ifelse(alc_use$famrel>3,1,0)

table(alc_use$famrel_b,alc_use$high_use)
##    
##     FALSE TRUE
##   0    54   36
##   1   205   75
prop.table(table(alc_use$famrel_b,alc_use$high_use),1)
##    
##         FALSE      TRUE
##   0 0.6000000 0.4000000
##   1 0.7321429 0.2678571
chisq.test(alc_use$famrel_b,alc_use$high_us)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  alc_use$famrel_b and alc_use$high_us
## X-squared = 5.0515, df = 1, p-value = 0.0246

Within this binary indicator, there is a statistically significant (see Chi-squared test) relationship between lower quality family relations and alcohol use.The probability of high use among those with lower quality relations is around 15% higher.

Finally, let’s look at the last two explanatory variables, sex and going out with friends

ggplot(data=alc_use,
  aes(x=high_use,y=goout))+
  geom_violin()

prop.table(table(alc_use$goout,alc_use$high_use),2)
##    
##          FALSE       TRUE
##   1 0.07335907 0.02702703
##   2 0.31660232 0.13513514
##   3 0.37451737 0.20720721
##   4 0.15444015 0.34234234
##   5 0.08108108 0.28828829
table(alc_use$sex,alc_use$high_use)
##    
##     FALSE TRUE
##   F   154   41
##   M   105   70
prop.table(table(alc_use$sex,alc_use$high_use),1)
##    
##         FALSE      TRUE
##   F 0.7897436 0.2102564
##   M 0.6000000 0.4000000
chisq.test(alc_use$sex,alc_use$high_use)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  alc_use$sex and alc_use$high_use
## X-squared = 14.921, df = 1, p-value = 0.0001121

Of these variables it is clear that going out has a positive association with high alcohol use, the distribution of going out is more pronounced towards the higher values (See the violin plot as well as the table of percentages). In addition, being a male increases the risk of high use: 40% of males and 20% of females are high alcohol users

Logistic regression

My initial model consists of parental education, binary family relations indicator, continuous going out variable and sex of the student:

model_1 <- 
  glm(high_use ~ 
        paredu + famrel_b + sex + goout,
      data=alc_use)
summary(model_1)
## 
## Call:
## glm(formula = high_use ~ paredu + famrel_b + sex + goout, data = alc_use)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6052  -0.3183  -0.1386   0.3627   1.0720  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03823    0.09877  -0.387  0.69890    
## paredu2     -0.12227    0.08427  -1.451  0.14764    
## paredu3     -0.01163    0.08331  -0.140  0.88905    
## paredu4     -0.08778    0.07931  -1.107  0.26910    
## famrel_b    -0.16560    0.05059  -3.273  0.00117 ** 
## sexM         0.17968    0.04365   4.116 4.77e-05 ***
## goout        0.14342    0.01925   7.450 6.89e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1728196)
## 
##     Null deviance: 77.700  on 369  degrees of freedom
## Residual deviance: 62.734  on 363  degrees of freedom
## AIC: 409.41
## 
## Number of Fisher Scoring iterations: 2

The summary of the model indicates that parental education decreases the odds of high alcohol use in any of the higher categories of education when compared to the lowest level of parental education. However, the differences are not statistically significant. Being a male and going out a lot increase the odds of high alcohol use, and having good relations with family decreases the odds. Based on this initial model, I will leave parental education out of my final model.

model_2 <- 
  glm(high_use ~ 
        famrel_b + sex + goout,
      data=alc_use)

OR <- exp(coef(model_2))
CI <- exp(confint(model_2))

cbind(OR,CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.8979596 0.7761920 1.0388300
## famrel_b    0.8533955 0.7728747 0.9423051
## sexM        1.1888255 1.0916995 1.2945926
## goout       1.1537940 1.1110075 1.1982283

In my final model, one step increase in going out increases the odds of high alcohol use by 15% (OR=1.15, 95% CI: 1.11, 1.20). Being a male increases the odds of high alcohol use by 19% (OR=1.19, 95% CI: 1.09, 1.29) and having good family relations decrease the odds by 15% (OR=0.85, 95% CI: 0.77,0.94). Since the confidence intervals do not include 1, all these association are statistically significant at 95% confidence level.

Model predictions

probs <- predict(model_2,
                 type = "response")
alc_use <- 
  mutate(alc_use, probability = probs)

alc_use <- 
  mutate(alc_use,
         predicted_high=probability>0.5)

#table
table(
  high_use = alc_use$high_use,
  prediction = alc_use$predicted_high)
##         prediction
## high_use FALSE TRUE
##    FALSE   253    6
##    TRUE     76   35
#Proportions
table(
  high_use = alc_use$high_use,
  prediction = alc_use$predicted_high) %>%
  prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68378378 0.01621622 0.70000000
##    TRUE  0.20540541 0.09459459 0.30000000
##    Sum   0.88918919 0.11081081 1.00000000
#Graphical assessment
ggplot(alc_use, 
       aes(x = probability,
           y = high_use,
           col=predicted_high)) +
  geom_point(size=2)

# loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#average number of wrong predictions

loss_func(class = alc_use$high_use,
          prob = alc_use$probability)
## [1] 0.2216216

The total, and average number, of inaccurately classified individuals is around 22%.The model estimates that only 11% would be high users whereas in the observed data, 30% were identified as high users. On the other hand, almost all of the high use predictions were true high users. Therefore, the estimates seem conservative, which is reassuring if any inference would be made. Erroneous TRUE predictions occured mostly for lower levels of predicted probabilities (see the Figure).

I am not sure how to compare this to a simple guessing strategy but I guess(!) that this would mean that if I simply try to guess one person’s alcohol consumption, there is a 50% chance I am wrong. Hence, I think the model is somewhat better than just guessing.

10-fold cross-validation

library(boot)
## Warning: package 'boot' was built under R version 4.0.3
cross_validation <-
  cv.glm(data = alc_use, 
         cost = loss_func, 
         glmfit = model_2, K = 10)

cross_validation$delta[1]
## [1] 0.227027

My model does seem to have better test performance when compaerd to the model in Datacamp: My average prediction error is around 23-24% (depends on the predictions) and in the Datacamp model it was 26%.

Finally, I will try different including different sets of predictors and see what happens to the training and testing errors. I will start with a large model with 10 predictors, and move towards 1.

m_1_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             paredu + health +
             absences + romantic +
             studytime + 
             school')
m_2_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             paredu + health +
             absences + romantic +
             school')
m_3_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             paredu + health +
             absences + 
             school')
m_4_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             paredu + health +
             school')
m_5_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             paredu + school')
m_6_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout +
             school')
m_7_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b + goout')
m_8_formula <- 
  as.formula('high_use~
             sex + age +
             famrel_b')
m_9_formula <- 
  as.formula('high_use~
             sex + age')
m_10_formula <- 
  as.formula('high_use~
             sex')
library(purrr)
#is_formula(m_10_formula)

model_vector <- c(
  m_1_formula,
  m_2_formula,
  m_3_formula,
  m_4_formula,
  m_5_formula,
  m_6_formula,
  m_7_formula,
  m_8_formula,
  m_9_formula,
  m_10_formula)

#is_formula(model_vector[[1]])

error_mat <- matrix(NA,nrow=3,ncol=10)

for(i in 1:10){
  alc_temp <- alc_use
  current_model <- model_vector[[i]]
  
  current_glm <- glm(current_model,
                     family='binomial',
                     data=alc_temp)
  
  probs <- predict(current_glm,
                 type = "response")
alc_temp <- 
  mutate(alc_temp, probability = probs)

alc_temp <- 
  mutate(alc_temp,
         predicted_high=probability>0.5)

#average number of wrong predictions
error_mat[1,i] <- 
  loss_func(class = alc_temp$high_use,
          prob = alc_temp$probability)

cross_validation <-
  cv.glm(data = alc_temp, 
         cost = loss_func, 
         glmfit = current_glm, K = 10)

error_mat[2,i] <-
  cross_validation$delta[1]

}

labels <- c("10","9","8","7","6",
     "5","4","3","2","1")

plot(error_mat[1,],type='l',
     col='blue',lwd=2,xaxt='n',
     xlab="Number of predictors",
     ylab="Size of error")
lines(error_mat[2,],col='orange',
      lwd=2)
axis(side=1,labels=labels,at=1:10)
legend(1,y=0.3,legend=c("Training error",
                "Testing error"),
       col=c('blue','orange'),
       lty=c(1,1))

Based on these investigations, it seems that when the number of predictor increase, bot the training and testing error get smaller. Training error, or the average number of wrong predictions, seems to be smaller than the testing error from the 10-fold cross-validation exercises. When the number of predictors gets smaller (below 4), both the testing and training errors increase and get closer together.

Data reference: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. https://archive.ics.uci.edu/ml/datasets/Student+Performance


Week 4

Introduction

Today’s exercise exercise will focus on different techniques of clustering and classification. I will use data on housing in areas of Boston and mostly focus on the crime rate in the city. The data can be accessed through the R package MASS. The data contains area-level information on the characteristics of homes (size, value etc.), the demographic composition of the area as well as several variables related to environmental and infrastructural factors. More information on the data is available here and in the original study by Harrison & Rubinfeld (1978).

BTW, I decided to hide my code chunks by default in this course diary as they make reading a bit tedious. You should be able to get the codes by clicking the Code button next to results.

Overview of the dataset

library(MASS)
library(tidyverse)
library(GGally)
library(corrplot)
data(Boston)
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
hist(Boston$crim)

The data includes 14 variables from some 500 regions in Boston. All of them are numeric.My main varibale of interest, crime rate, seems highly skewed, with most of the areas having low rates of crime and a few expressing higher rates.

p.values.mat <-cor.mtest(Boston,
                         conf.level = .95)
cor.mat <- cor(Boston)
corrplot.mixed(cor.mat,
         lower.col='black',
               upper='color',
               tl.col='black',
               tl.cex=0.5,
               number.cex=0.5,
               p.mat=p.values.mat$p,
               sig.level=0.05)
Correlation plot

Correlation plot

For a graphical overview of data, I am using the correlation plot to make the plot to some extent readable (compared to pairs or ggpairs). In the plot, I have crossed out all the correlations not significant at 95% confidence level. It seems that the variable chas is not significantly correlated with most of the variables (probably as it is binary). Anyhow, the variable indicates whether the area is bounded by river Charles, which makes little sense to me.

Most of the other variables are statistically correlated, and there seems to be relatively strong correlations, for instance property tax rate (tax) and access to highways (rad) have a correlation of 0.91, age (age) of houses and distance from city centre (dis) have a correlation of -0.75, and nitrous ocygen emissions and proportion of non-retail businesses (=industry) have a correlation of 0.76. These are rather intuitive. Most of the correlations seem moderate, between 0.3 and 0.6.The highest correlations between crime rate occur for variables access to radial highways and property tax rate.

Linear Discriminant analysis

As a next part of the assignment, I am running a Linear Discriminant Analysis (LDA). LDA allows for classifying observations to pre-known categories, based on linear associations between variables in the data.There are two assumptions in the model, the variables are normally distributed and has the same variance. Thus, the process starts by scaling the data.The effects of scaling can be seen on the following summary: all the variables are centered around their mean, which is now 0.

#Scale boston data

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Next, I will first calculate target classes for the LDA model. I am using the crime rate in the areas and dividing it into quartiles. Then, I divide the data into train and test sets by randomly sampling 80% of the areas (train). The rest of the areas are included in the test set.Lastly, I fit the LDA model on the training data and test the model by conducting predictions with the test set

#Save categories of crime rate
bins <- quantile(boston_scaled$crim)
#Create new crime variable
crime <- cut(boston_scaled$crim,
             breaks=bins,
             include.lowest=T,
             label=c(
               "low",
               "med_low",
               "med_high",
               "high"))
boston_scaled$crim <- NULL
boston_scaled$crime <- crime

#Divide data into test and training sets
n <- nrow(boston_scaled)

#Randomly sample 80% of the original rows
#These are used for training
ind <- sample(n, size=n*0.8)

#Train set
train <- boston_scaled[ind,]
#Test set
test <- boston_scaled[-ind,]

#Correct classes in the test set
correct <- test$crime

#Drop crime from test
test <- dplyr::select(test, -crime)

#LDA model
lda.fit <- 
  lda(crime~., data=train)

# the function for lda biplot arrows
#(Stolen from Datacamp)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# predict classes with test data
lda.pred <- predict(lda.fit,
                    newdata = test)

Let’s have a look at the results.

# plot the lda results
plot(lda.fit, dimen = 2,
     col=classes,
     pch=classes)
lda.arrows(lda.fit, myscale = 2)

#summary of the model
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2599010 0.2549505 0.2549505 0.2301980 
## 
## Group means:
##                   zn      indus         chas        nox         rm        age
## low       0.97425764 -0.9238894 -0.159840490 -0.8909298  0.3940558 -0.8910670
## med_low  -0.06533749 -0.3495082 -0.004759149 -0.5854509 -0.1266602 -0.3574322
## med_high -0.36984842  0.1672645  0.109913674  0.3394447  0.2133584  0.4279554
## high     -0.48724019  1.0173143 -0.060657012  1.0729111 -0.3750647  0.8201536
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8970000 -0.6810822 -0.7500341 -0.37010071  0.38130833 -0.75171116
## med_low   0.4183290 -0.5358646 -0.4911728 -0.09785201  0.36220572 -0.14217722
## med_high -0.4007176 -0.4165578 -0.3193919 -0.27678429  0.09517382 -0.05039298
## high     -0.8336634  1.6361396  1.5126335  0.77846144 -0.66909921  0.86691596
##                 medv
## low       0.49638330
## med_low   0.01258411
## med_high  0.25379554
## high     -0.63829230
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08266125  0.57265734 -1.03898336
## indus    0.06076957 -0.33109926  0.17129115
## chas    -0.08357570  0.02233658  0.18999452
## nox      0.36515510 -0.78642298 -1.22127540
## rm      -0.12574264 -0.12835791 -0.15686805
## age      0.25640497 -0.35565920 -0.11522901
## dis     -0.04658707 -0.16942276  0.29618209
## rad      3.10017827  1.01642852 -0.02977765
## tax      0.07246627  0.03044068  0.58900645
## ptratio  0.07149625  0.01344685 -0.40525274
## black   -0.10788396  0.02507596  0.20287249
## lstat    0.22105335 -0.13167025  0.46167653
## medv     0.19430092 -0.37498807 -0.16441355
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9418 0.0437 0.0145
# cross tabulate 
#the correct classes vs. predictions
table(correct = correct, 
      predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11       9        2    0
##   med_low    3      12        8    0
##   med_high   0      10       12    1
##   high       0       0        0   34
table(correct = correct, 
      predicted = lda.pred$class) %>%
  prop.table(2) %>% round(digits=2)
##           predicted
## correct     low med_low med_high high
##   low      0.79    0.29     0.09 0.00
##   med_low  0.21    0.39     0.36 0.00
##   med_high 0.00    0.32     0.55 0.03
##   high     0.00    0.00     0.00 0.97

The most relevant information is included in the scatterplot, aka biplot, of the first two linear discriminants. The plot shows that the grouping with these LDAs of high crime rate areas was relatively successful, whereas the other groups tend to have more overlap. The arrows in the plot show the importance of each variable, and to which direction the are working. It seems that the access to radiaal highways sparates relatively well the high crime rate group. The summary of the model shows the same: mean of rad is high in the high crime rate group, and lower in others.The rad variable has also a large coefficient from the LD1. These findings indicate that crime occurs in places where there are easy to access highways, or, rather, in places where people come and go (probably city centres and some sort of knots in the public transportation system).

When we look at the predictions done with the test data, we see similarly as in the biplot that the model groups the high crime rate areas mostly correct but struggles more with the others. This is probably related to the skewness of the crime variable: the lowest three classes are relatively similar and the hig crime rate group is somewhat special case.

K-means clustering

Okay then, let’s move on from classification to clustering. I will run a k-means algorithm to identify clusters in the Boston data. I start by calculating the Euclidean distances and showing a summary of these, for the reason that the k-means algorithm will use these distances. Second, I start running the k-means algorithm for different cluster numbers, ranging from 1 to 15, and calculate the Total Within Cluster Sum of Squares (TWCSS) after each iteration. Usually, the suitable number of clusters is where the TWCSS has the highest drop.

#Reload and rescale boston
data(Boston)
boston_scaled <-
  as.data.frame(scale(Boston))

#Calculate distances between observations
#I use Euclidean for no specific reason
#except that the km algorithm uses it
#by default

distances <- dist(boston_scaled)
summary(distances)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
#Identify correct number of clusters
#Use the TWCSS for this purpose

k_max <- 15 #Arbitrary number

twcss <-
  sapply(1:k_max,
         function(k){
           kmeans(
             boston_scaled,k)$tot.withinss})
plot(x=1:k_max,y=twcss,type='l')

#2 seems appropriate

According to the plot, two seems to be the correct number of clusters. I will select that and repeat the k-means algorithm with 2 clusters.

#Run k-means algorithm

km <- kmeans(boston_scaled,centers=2)

#Plot the data set
pairs(boston_scaled[
  c(1,5,6,7,8,12,13,14)], 
        col=km$cluster)

Above, I have selected some variables that seem to show reasonable patterning in the data. The scatter plots show the differences between the two clusters identified with the k-means. Indeed, the clusters appear more or less feasibly separated for each of the variables.

K-means + LDA

Now, I will combine the k-means algorithm with the LDA. I first identify three target classeswith k-means to be used in LDA, and the fit the model.

km2 <- kmeans(boston_scaled,centers=3)
boston_scaled$km_clust <- km2$cluster

#LDA model (rename to avoid confusion
#in the next step)
lda.fit2 <- 
  lda(km_clust~., data=boston_scaled)

lda.fit2
## Call:
## lda(km_clust ~ ., data = boston_scaled)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2094862 0.3241107 0.4664032 
## 
## Group means:
##         crim         zn     indus        chas        nox          rm
## 1 -0.4075892  1.5146367 -1.068814 -0.04947434 -0.9962503  0.92400834
## 2  0.8046456 -0.4872402  1.117990  0.01575144  1.1253988 -0.46443119
## 3 -0.3760908 -0.3417123 -0.296848  0.01127561 -0.3345884 -0.09228038
##           age         dis        rad        tax     ptratio      black
## 1 -1.16762641  1.19486951 -0.5983266 -0.6616391 -0.74378342  0.3538816
## 2  0.79737580 -0.85425848  1.2219249  1.2954050  0.60580719 -0.6407268
## 3 -0.02966623  0.05695857 -0.5803944 -0.6030198 -0.08691245  0.2863040
##        lstat        medv
## 1 -0.9480974  0.97889973
## 2  0.8719904 -0.68418954
## 3 -0.1801190  0.03577844
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03134296 -0.14880455
## zn      -0.06381527 -1.22350515
## indus    0.61086696 -0.10402980
## chas     0.01953161  0.03579238
## nox      1.00230143 -0.70464917
## rm      -0.16285767 -0.44390394
## age     -0.07220634  0.59785382
## dis     -0.04270475 -0.45498614
## rad      0.71987743 -0.02882054
## tax      0.98285440 -0.70663319
## ptratio  0.22527977 -0.15514668
## black   -0.01693595  0.03181845
## lstat    0.18274033 -0.50122677
## medv    -0.02892966 -0.64244841
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8409 0.1591
classes <- 
  as.numeric(boston_scaled$km_clust)

# plot the lda results
plot(lda.fit2, dimen = 2,
     col=classes,
     pch=classes)
lda.arrows(lda.fit2, myscale = 2)

As can be seen from the biplot, the three identified clusters work relatively nicely with the LDA. We can see three clusters separated relatively well by the variables. Most important separators seem to be access to highways, the age of the building stock and property tax rate. The tax rate operates to same direction with highways, indicating that there might be more expensive housing near good travelling options. Old houses seem to form their own cluster.

Plotly fun

As the last part of the analysis, I am comparing the original LDA with the k-means with two clusters. The plots project data points based on the LDA, rather than actual observatioins. In the first plot, the colors represent the four original classes, and in the second, the two k-means clusters.

model_predictors <- 
  dplyr::select(train, -crime)


# matrix multiplication
matrix_product <- 
  as.matrix(model_predictors) %*% lda.fit$scaling

matrix_product <- as.data.frame(matrix_product)

library(plotly)

plot_ly(x = matrix_product$LD1,
        y = matrix_product$LD2,
        z = matrix_product$LD3,
        type= 'scatter3d',
        mode='markers',
        color=train$crime)
#let's fit still another k-means
#For some reason, the code does not work
#with the original train data
#However, this just loads the boston
#and limirts the same areas so no 
#errors here, I guess

data(Boston)
boston_scaled <- as.data.frame(scale(Boston))
train <- boston_scaled[ind,]
km3 <- kmeans(train, centers=2)

plot_ly(x = matrix_product$LD1,
        y = matrix_product$LD2,
        z = matrix_product$LD3,
        type= 'scatter3d',
        mode='markers',
        color=km3$cluster)

We can see from the plots that the original crime rate reflects in the k-means clusters: Those with high or medium high crime rate form one cluster, and those with lower rates another. Using the k-means clusters as target classes is likely to produce a better model than the original solution with four target groups.

References

Harrison, D. & Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. 1978. Journal of Environmental Economics and Management 5(1), 81-102.


Week 5

Introduction

The fifth week of the course focuses on different dimension reduction techniques. I will be using Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA). For practicing these methods, I am doing a few tasks with the United Nations Development Programme data from Human Development Index (HDI) and Gender Inequality Index (GII) databases. The data covers countries across the globe. Metadata are available here.

Graphical overview

library(openxlsx)
setwd("~/IODS-project/data/")
human <- read.xlsx("human.xlsx")

library(GGally)
library(tidyverse)
library(corrplot)
summary(human)
##    edu_ratio        lab_ratio         exp_edu         life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni          maternal_mx      teen_births      female_parl   
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human,
        upper = list(continuous =
                       wrap("cor", size =
                              3)))

corrplot(cor(human))

From the summary above, we can see that all the variables in the data are continuous. We have information on demographic characteristics, including life expectancy (life_exp), maternal mortality (maternal_mx) and teen birth rate (teen_births).Socioeconomic indicators related to gender equality include the ratio of female labour force rate to male labour force rate (lab_ratio) and the ratio of rate of secondary educated females to the rate of secondary educated males (edu_ratio). Lastly, we have information on the proportion of females in the parliament and the gross national income (GNI) and the expected number of years of education (exp_edu). The data are available for 195 countries.

We can start to identify associations between these variables from the ggpairs output and the correlation plot above. First, we see that the demographic indicators, overall life expectancy, maternal mortality rate and teen birth rate are strongly correlated. These are also correlated with the GNI, the expected number of years of education and gender equality in education. Life expectancy is correlated positively with these indicators, whereas maternal mortality and teen births negatively.Gender equality in the labour force or in the parliament does not seem to have strong correlations with any of the variables.

Principal component analysis

Given that the data used describes multiple aspects of societies, identifying bivariate associations is somewhat uninteresting. Therefore, I will perform a PCA to identify whether the indicators presented above belong to same dimensions and if the dimensions have meaningful relationships between each other. I will start by running the analysis on unmodified data.

#PCA
pca_human <- prcomp(human)
#summary
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
#Save proportions of variance explained
proportions <-
  round(
    summary(pca_human)$importance[2,]*100,
    digits=3)

pc_lab <- 
  paste0(
  names(proportions)," (", proportions, "%)")
#biplot
biplot(pca_human, 
       choices=1:2,
       cex=c(0.5,0.4),
       xlab=pc_lab[1],
       ylab=pc_lab[2])
PCA with unstandardized data. GNI explains all of the variability

PCA with unstandardized data. GNI explains all of the variability

Okay then, from the summary of the model we see that the model identified 8 principle components (which is the number ofthe variables) but the first of these explains 99.99%=100% of the variation in the data. If we look at the biplot, we can see that the only important component seems to be the gross national income (I guess many politicians use this type of PCA in their reasonings). The fact that GNI overrides all the other variables is related to the fact that in th unmodified data, all the variables have different variances and the PCA treats the variable with the largest variance as the most important one. Therefore, to actually identify the real dimensions, I need to scale the data and run the analysis again.

#scale data
human_sc <- scale(human)

pca_human_sc <- prcomp(human_sc)

summary(pca_human_sc)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
proportions <-
  round(
    summary(pca_human_sc)$importance[2,]*100,
    digits=3)

pc_lab <- 
  paste0(
  names(proportions)," (", proportions, "%)")

biplot(pca_human_sc, 
       choices=1:2,
       cex=c(0.5,0.4),
       xlab=pc_lab[1],
       ylab=pc_lab[2])
PCA with scaled data. First principal component measuring overall economic and vital well-being, second gender equality

PCA with scaled data. First principal component measuring overall economic and vital well-being, second gender equality

Now, a much more interesting plot is produced. We can see that the sociodemographic indicators education, GNI, life expectancy, maternal mortality and teen births load to the first principal component. That component explains over 50% of the total variability in the data. We also see that maternal mortality and teen births operate to an opposite direction when compared to the other factors. These makes sense as it would be weird if for instance GNI would increase with increasing rates of maternal mortality and teen births. These correlations were already identified above in the graphical overview step.

Second, the new PCA produced another distinct principal component, which seems to describe gender equality. The gender ratio at the labour market and proportion of females in parliament relate to this component. This dimension seems to be genuinely distinct from the first as the variables related to this component have almost 90 degree angle (meaning low correlation) in the arrows when compared to the indicators influencing dimension one.I might interpret this to indicate that gender equality in labour market and parliament is not related to economic and “vital” well-being in the society.Instead, other factors (maybe values, attitudes etc.) are at play. A surprising thing is that the gender equality in education seems not to belong to the gender equality component. However, this is probably because the variable only includes information on secondary education. It might be that for overall increases well-being, it is necessary to have a population where each member has at least some education. Differences might occur if tertiary education was used as the measure of education.

Multiple Correspondence Analysis

As the second part of this assignment, I am implementing the MCA to a pre-existing dataset called tea, available in R package FactoMineR. The data contains information of tea drinking habits of 300 individuals. I chose 6 first variables from the data set that measure the time when these people drink tea, and try to identify similar components and in the PCA but this time for categorical data.

library(FactoMineR)
data(tea)

tea_plc_time <-
  tea[,1:6]

mca <- MCA(tea_plc_time,graph=F)

plot(mca,habillage="quali",invisible=c("ind"))

The MCA calculates distances between variables in a three-dimensional space (I think, at least). In the plot above, the distances between the variables at first two-dimensions are plotted. We can see that the variable categories opposite to each other (no/yes) are plotted to opposite quadrants of the plot. Second, we see that similar variables are plotted close to each other (for instance not breakfast and not tea time). Third, the variable categories that are well categorized by the dimensions occur further from the center of the plot than others.We can clearly see that especially dinner and lunch seem to determine to be well distinguished. We can also confirm this by looking at the bar plots of the variables: it is clear that there seems to be a relatively small group of lunch or dinner drinkers.

data(tea)

gather(tea[,1:6]) %>% 
  ggplot(aes(value)) + 
  facet_wrap("key",scales="free") + 
  geom_bar()

Let’s continue the analysis by adding the individuals to the plot:

plot(mca,habillage="quali")

Now, we have plotted in the upper left corner the individuals whose tea drinking habits are characterized by drinking tea during lunch and evenings. In th upper right corner we have those individuals who apparently do not drink tea at all (see also the above plot to better see the categories). The lower right corner represents tea drinkers that limit their consumption to dinner time. Finally, the lower left corner includes individuals that want to preserve their good night’s sleep and only drink tea in the mornings and during tea time. The edgiest group seems to be those drinking with dinner as they do not tolerate drinking tea at any other time (Given that the interpretation is correct. May as well be not).